Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more catching of build errors, dump of logs, etc.; add registry pod d… #8591

Merged
merged 1 commit into from
May 7, 2016

Conversation

gabemontero
Copy link
Contributor

…isk usage analysis

@bparees @csrwng FYI - next round of debug on ext test registry disk quota error ... caught some more tests' build failures, and added the running of df/du inside the registry pod.

@gabemontero
Copy link
Contributor Author

[testonlyextended][extended:core(builds)]

@gabemontero gabemontero self-assigned this Apr 21, 2016
@bparees
Copy link
Contributor

bparees commented Apr 21, 2016

lgtm, needs @smarterclayton approval. and let's make sure we have a plan to strip all this out (aside from the waitforabuild calls which seems like a good general improvement) once we get to the bottom of this.

@smarterclayton
Copy link
Contributor

Why do we need to merge this vs just running testonlyextended?

@gabemontero
Copy link
Contributor Author

@smarterclayton I'm in agreement with not merging at this time, especially given the lastest test results. The df/du of registry pod along with the df of the ec2 instance shed some light on the situation, though the root cause is still TBD.

Specifically:

  1. The registry fs does in fact seem to be allocated the entire fs from the host system, as @bparees stipulated earlier today. I'll paste a couple of relevant entries here:

ec2 df:
/dev/mapper/docker--vg-openshift--xfs--vol--dir 1038336 295328 743008 29% /mnt/openshift-xfs-vol-dir

registry df:
/dev/mapper/docker--vg-openshift--xfs--vol--dir 1038336 295328 743008 29% /registry

Those two entries were immediately after a push to the registry failed with the same message reported earlier:

time="2016-04-21T21:51:06.639377806Z" level=error msg="response completed with error" err.code=UNKNOWN err.detail="filesystem: write /registry/docker/registry/v2/repositories/extended-test-build-image-source-eoxy7-hoeui/imagesourceapp/_uploads/30d40477-3875-4f92-b607-1e6b418f4d79/data: disk quota exceeded" err.message="unknown error" go.version=go1.6 http.request.host="172.30.131.224:5000" http.request.id=941f71c2-a05f-4830-a6d7-cb885abcb54f http.request.method=PATCH http.request.remoteaddr="172.18.12.132:34979" http.request.uri="/v2/extended-test-build-image-source-eoxy7-hoeui/imagesourceapp/blobs/uploads/30d40477-3875-4f92-b607-1e6b418f4d79?_state=KKbjEmD56jWJfd8GXI1LpPcAM2QkR8GiEg2V9UrfdlN7Ik5hbWUiOiJleHRlbmRlZC10ZXN0LWJ1aWxkLWltYWdlLXNvdXJjZS1lb3h5Ny1ob2V1aS9pbWFnZXNvdXJjZWFwcCIsIlVVSUQiOiIzMGQ0MDQ3Ny0zODc1LTRmOTItYjYwNy0xZTZiNDE4ZjRkNzkiLCJPZmZzZXQiOjAsIlN0YXJ0ZWRBdCI6IjIwMTYtMDQtMjFUMjE6NTA6NTguMTkxMzI5MjVaIn0%3D" http.request.useragent="docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64" http.response.contenttype="application/json; charset=utf-8" http.response.duration=8.298964494s http.response.status=500 http.response.written=293 instance.id=63aa5205-b8fb-4cf4-ad4e-12d25be0054b vars.name="extended-test-build-image-source-eoxy7-hoeui/imagesourceapp" vars.uuid=30d40477-3875-4f92-b607-1e6b418f4d79

The du /registry output on the registry pod matched up with the data from df. A mixture of image entries, secret entries, etc. A decent number of files, none overly big.

So either the entry that causes the problem is exceptionally large, and is then cleaned up as part of error processing, hence gone before we can run df/du, or the message details about disk quota being violated are misleading somehow (or we are misinterpreting what it means).

As @bparees and I discussed earlier, I think we've done due diligence here. It is time to pull in the team that owns the registry, and get their help in deciphering what is going on. I'm assuming we can add whatever debug they would need into this PR, and rerun the extended test.

Thoughts?

@gabemontero
Copy link
Contributor Author

@pweil- during the build/image extended tests (both with the periodic jobs from from https://ci.openshift.redhat.com/jenkins/job/origin_extended_build_tests , as well as individual PRs like this one) we have been getting an intermittent 500 on pushing images to the internal registry.

When examining the registry logs, we see messages at the corresponding times to the push failures in the builder logs like the ones noted in my prior comment. However, while the message implies a disk space / quota issue, our examinations of the file system in particular show plenty of space.

Are the logs for the last test run sufficient for your team to provide some insight? Or do we need to re-run with some additional debug?

thanks

@pweil-
Copy link
Contributor

pweil- commented Apr 25, 2016

I believe that @soltysh added some additional logging specifically for docker. @soltysh can you take a look here?

@soltysh
Copy link
Contributor

soltysh commented Apr 25, 2016

Looking...

@gabemontero gabemontero force-pushed the buildExtTestDbg branch 2 times, most recently from 26c0262 to cc744a9 Compare April 26, 2016 15:04
@gabemontero
Copy link
Contributor Author

hey @soltysh - have you had a chance to repro / look at this any? Is there some debug you would like me to add to the PR and retest?

@soltysh
Copy link
Contributor

soltysh commented Apr 28, 2016

@gabemontero sorry, didn't get to it, yet. I wanted to finish job-related stuff first. Hopefully I should have some time tomorrow to look at it.

@gabemontero
Copy link
Contributor Author

Cool no worries @soltysh

On Thursday, April 28, 2016, Maciej Szulik [email protected] wrote:

@gabemontero https://github.com/gabemontero sorry, didn't get to it,
yet. I wanted to finish job-related stuff first. Hopefully I should have
some time tomorrow to look at it.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8591 (comment)

@gabemontero
Copy link
Contributor Author

So a little internet research turned up a docker support thread, that while unresolved, showed some similarities with what we are seeing: https://meta.discourse.org/t/disk-quota-exceeded-when-chown-ing/36500/20

Based on this and another couple of similar hits, I"m going to add a call to quota along with the df usage in the debug I added and see if it sheds any light.

If not, @csrwng had the suggestion that maybe selinux is playing a part. Perhaps I'll see about disabling it in the PR test runs, see if any change in behavior occurs.

@gabemontero gabemontero force-pushed the buildExtTestDbg branch 6 times, most recently from 6939a7c to 3057cde Compare May 2, 2016 21:13
@gabemontero
Copy link
Contributor Author

Things attempted / found today:

  1. after adding the quota command to the openshift registry image, running it in the registry container after the error provides not data, it simply returns
  2. running getenforce in the AWS container hosting the tests reports it is already in "Permissive" mode
  3. I can manually create directories in /registry directory in the docker registry pod via mkdir after the error occurs
  4. However, an attempt to create a directory in /registry/docker/registry/v2/repositories after the error occurs via mkdir in the registry pod results in the same disk quota exceeded error.
  5. I even hacked docker to attempt doing a mkdir when the golang os.MkdirAll failed, and go the same disk quota exceeded error.

Still think I need to figure out how to get a report of quota level stats and/or config in the registry container .... might next look and dumping the contents of /etc/mtab or /proc/self/mounts (if it in fact either exists in the registry container) or whatever is the right spot in this scenario.

@gabemontero gabemontero force-pushed the buildExtTestDbg branch 2 times, most recently from dfdde46 to bcdcf06 Compare May 4, 2016 19:14
@gabemontero
Copy link
Contributor Author

Some more updates:

  • the inconsistency between being able to create directories in /registry vs. /registry/docker/registry/v2/repositories ... most likely I misinterpreted the earlier results. But consistency across an FS makes more sense.
  • not entirely confident that selinux is not coming into play ... based on this doc I've tried examining /var/logs/audit and running ausearch (both from the AWS host and from within the registry container) but no errors turned up.

Definitely feel like we've moved past my linux/AWS expertise and possibly to the point of diminishing returns in trying to uncover the disk quota exceeded behavior.

Before officially punting and simply opening an issue/bugzilla to officially track this with the platform mgmt team, I'm going to try again to look at pruning images from the registry between tests and see if that keeps us under this mysterious disk quota.

@smarterclayton
Copy link
Contributor

Note that we run conformance tests with an XFS quota on empty dir.
@dgoodwin

On May 4, 2016, at 4:42 PM, Gabe Montero [email protected] wrote:

Some more updates:

Definitely feel like we've moved past my linux/AWS expertise and possibly
to the point of diminishing returns in trying to uncover the disk quota
exceeded behavior.

Before officially punting and simply opening an issue/bugzilla to
officially track this with the platform mgmt team, I'm going to try again
to look at pruning images from the registry between tests and see if that
keeps us under this mysterious disk quota.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8591 (comment)

@gabemontero
Copy link
Contributor Author

Thanks for the confirmation. Still wonder why the 'quota' cmd returns
blank. A quick search on xfs quota turned up 'report' as well. I'll try
adding an invocation of that.

On Wednesday, May 4, 2016, Clayton Coleman [email protected] wrote:

Note that we run conformance tests with an XFS quota on empty dir.
@dgoodwin

On May 4, 2016, at 4:42 PM, Gabe Montero <[email protected]
javascript:_e(%7B%7D,'cvml','[email protected]');> wrote:

Some more updates:

Definitely feel like we've moved past my linux/AWS expertise and possibly
to the point of diminishing returns in trying to uncover the disk quota
exceeded behavior.

Before officially punting and simply opening an issue/bugzilla to
officially track this with the platform mgmt team, I'm going to try again
to look at pruning images from the registry between tests and see if that
keeps us under this mysterious disk quota.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8591 (comment)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8591 (comment)

@gabemontero
Copy link
Contributor Author

Well, xfs_quota -xc report at least returns something at the end of the run when executed on the AWS test host, though nothing that would indicate disk quotas being defined or contention on said resources:

Group quota on /mnt/openshift-xfs-vol-dir (/dev/mapper/docker--vg-openshift--xfs--vol--dir)
Blocks
Group ID Used Soft Hard Warn/Grace


ec2-user 0 0 0 00 [--------]

Group quota on /mnt/openshift-xfs-vol-dir (/dev/mapper/docker--vg-openshift--xfs--vol--dir)
Blocks
Group ID Used Soft Hard Warn/Grace


ec2-user 0 0 0 00 [--------]

Running the same command from the extended test when the error is reported return nothing.

@dgoodwin @smarterclayton do we perhaps have some sort of xfs_quota or docker emptydir problem?

@gabemontero gabemontero force-pushed the buildExtTestDbg branch 3 times, most recently from 76db393 to d01f79d Compare May 6, 2016 16:34
@gabemontero
Copy link
Contributor Author

@bparees @dgoodwin - PTAL at the proposed changes, I'd like to get these merged and move on to the next set of issues with the build/images extended tests.

On the xfs_quota related change, I backed off a small bit on the increase in the quota size (went from 256 Mi to 896 Mi), so that we don't consume the full 1 Gi amount the ec2 instances have allocated. So in theory we should still see the quota stop things before the FS is in fact fully consumed.

Also, @bparees @csrwng - the 2 tests with the internal git server in gitauth.go have now failed a couple of times in a row with the same error since we moved past the disk quota issue. And these are the only build extended tests that failed. That will be the next area of focus for my debug, but I seem to recall from scrum some debug activity with the internal git server. Are my recollections correct / are there known issues that would affect these external tests? If not, some background into why the URLs are constructed the way they are might be helpful.

thanks

@gabemontero
Copy link
Contributor Author

I can reproduce the gitauth.go extended test errors locally.

@@ -48,25 +49,56 @@ func DumpBuildLogs(bc string, oc *CLI) {
fmt.Fprintf(g.GinkgoWriter, "\n\n got error on bld logs %v\n\n", err)
}

ExamineDiskUsage()
// if we suspect that we are filling up the registry file syste, call ExamineDiskUsage / ExaminePodDiskUsage
// also see if manipulations of the quota around /mnt/openshift-xfs-vol-dir exist in the extended test set up scripts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be one case where i'd leave the call in place, commented out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add calls commented out

@bparees
Copy link
Contributor

bparees commented May 6, 2016

@gabemontero one nit and then i'd be ok w/ merging this if that's what you want.

@openshift-bot
Copy link
Contributor

Evaluated for origin testonlyextended up to 5be0a98

@gabemontero
Copy link
Contributor Author

@bparees yeah, let's go ahead and merge ... aside from wanting to see the results for the build extended tests on the official job, I'm curious on the delta behavior on the images extended tests (something seems off with my PR for the images side).

We'll work these gitauth.go failures in a separate PR. The are showing up locally anyway they are just broke vs. getting impacted by the env.

@csrwng
Copy link
Contributor

csrwng commented May 6, 2016

The problem with the gitserver tests is that they rely on the router and .xip.io to test ssl auth:
F0506 17:19:14.732361 1 builder.go:204] Error: build error: fatal: unable to access 'https://gitserver.172.18.2.53.xip.io/ruby-hello-world/': Could not resolve host: gitserver.172.18.2.53.xip.io; Name or service not known

It seems that .xip.io resolution has been flakier than usual lately

@gabemontero
Copy link
Contributor Author

Yep that was where I was at ... where I was headed was simply pulling the
IP/port from the service and be done with it.

Or do we really care about the name resolution?

On Fri, May 6, 2016 at 3:33 PM, Cesar Wong [email protected] wrote:

The problem with the gitserver tests is that they rely on the router and .
xip.io to test ssl auth:
F0506 17:19:14.732361 1 builder.go:204] Error: build error: fatal: unable
to access 'https://gitserver.172.18.2.53.xip.io/ruby-hello-world/': Could
not resolve host: gitserver.172.18.2.53.xip.io; Name or service not known

It seems that .xip.io resolution has been flakier than usual lately


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8591 (comment)

@csrwng
Copy link
Contributor

csrwng commented May 6, 2016

We need it if we want to test secured communication to the server (via SSL)... only possible through the router currently

@gabemontero
Copy link
Contributor Author

The router logs has this (and something similar for the gitserver-tokenauth service):

W0505 21:01:11.074032 1 router.go:572] a edge terminated route with host gitserver.172.18.5.125.xip.io does not have the required certificates. The route will still be created but no certificates will be written

Don't know yet if this is benign or not.

@bparees
Copy link
Contributor

bparees commented May 6, 2016

lgtm, thanks for the deep dive on this @gabemontero
[merge]

@openshift-bot
Copy link
Contributor

openshift-bot commented May 6, 2016

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/5841/) (Image: devenv-rhel7_4131)

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 5be0a98

@openshift-bot
Copy link
Contributor

[Test]ing while waiting on the merge queue

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 5be0a98

@gabemontero
Copy link
Contributor Author

NP @bparees

On the route, based on the comments on this method, we might possible expect a connection error, not a resolve host error.

Going to hit pause for now, wait for this PR to merge, and then see about pulling in the networking team / brainstorming during scrum on Monday.

@csrwng
Copy link
Contributor

csrwng commented May 6, 2016

@gabemontero If what you're seeing is this:
https://github.com/openshift/origin/blob/master/pkg/router/template/router.go#L570
that is expected. The default route certificate is a wildcard certificate. I don't think that's the issue. The problem is that the .xip.io address is not getting resolved. The fix would be to come up with an alternative, more stable dns server. I'll investigate if we can bring up our own proxy server that resolves the one host we need.

@gabemontero
Copy link
Contributor Author

@csrwng yep that is where I was landing wrt that certificate based log. The use of the openshift proxy server as an alternative sounds cool. Consider the baton officially passed to you then until you get a chance to investigate :-)

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/testonlyextended FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin_extended/162/) (Extended Tests: core(builds))

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/3654/)

@openshift-bot openshift-bot merged commit 91e2f0b into openshift:master May 7, 2016
@gabemontero gabemontero deleted the buildExtTestDbg branch May 7, 2016 16:23
@soltysh
Copy link
Contributor

soltysh commented May 9, 2016

@gabemontero awesome work!!! Thank you! And sorry I wasn't able to get into that...

@gabemontero
Copy link
Contributor Author

Thanks @soltysh - and no worries - it was educational ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants